12 - Artificial Intelligence II [ID:52584]
50 von 751 angezeigt

Okay, then let's start by recapping what we did last time.

We were talking about sequential decision problems, more precisely Markov decision problems.

As an example, here is our 4x3 world.

We for now assume that our environment is fully observable and that we have a finite

set of states, in this case the 11 that we see here.

But our environment is probabilistic, so which precise state we end up in given a certain

action is unclear in this particular example, because if we move in one direction, there's

a 20% chance we accidentally walk in the wrong direction, 10% left, 10% right with

respect to whatever direction we're heading currently.

A Markov decision process consists therefore of the following.

We have a set of states.

For every state we have a set of actions.

We have a transition model, which is just the probability distribution over our states

given the current state and our action.

And we have a reward function, which we use in lieu of a proper utility function, but

which we later use to compute a utility function ultimately.

The goal of a Markov decision procedure is to find a policy.

What is a policy?

A policy is a mapping from states to actions, and an optimal policy is a policy where every

action given a certain state maximizes expected utility naturally.

So the goal obviously is to find an optimal policy.

Here is one example of one such policy.

We've also seen that depending on what rewards I pick for my states, the behavior of an agent

that executes the optimal policy will be very different depending on the rewards.

How do we get from rewards to utility functions?

Well, the problem that we have is that we cannot properly assess utility functions a

priori.

In particular, we cannot just assess or observe utility functions from human behavior.

We've seen that before in the simple case.

Now in the more complicated case, we have to take preferences over sequences of rewards,

i.e. sequences of states ultimately into account.

We largely assume that they are stationary, which just means that my preference over two

sequences is independent, only depends on the individual elements in that sequence and

not on the rest, i.e. if the initial element of these two sequences is identical, but we

prefer this sequence over this sequence, then we should also prefer this sequence over this

sequence, i.e. it does not depend on any individual element.

For stationary preferences, there are only two ways to get to a utility function ultimately.

Either I just add them up or this happening to be a special case of I discountedly add

them up, i.e. I introduce some kind of discount factor gamma between 0 and 1 and then add

the various powers of gamma in front of my rewards over the sequence.

That's one way to solve the problem that with infinite lifetimes I have to do is to add

i.e. infinitely long sequences of states and all rewards. I have a problem if I want to

accumulate some kind of fixed utility. One possible solution is also to just say we cut

off computation at some fixed time t. That's annoying because suddenly my policy needs

to take the remaining time into account with respect to how I act.

Alternative two, at some point my agent dies anyway necessarily, no matter what the result

of which policy I execute and of course variant number three which is the one that we're largely

going to deal with, we introduce a discount factor. In that case, this sequence over all

expected rewards actually converges because it's a geometric series.

Here is a list of advantages of discounted rewards. We've talked about those. Once we've

Teil einer Videoserie :

Zugänglich über

Offener Zugang

Dauer

01:16:08 Min

Aufnahmedatum

2024-06-04

Hochgeladen am

2024-06-06 02:09:41

Sprache

en-US

Einbetten
Wordpress FAU Plugin
iFrame
Teilen